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efforts  to  achieve  practical  formant  vocoders  have  been  plagued  with  problems  of  for- 
mant tracker  instability,  resulting  in  unnatural  * warbles'*  in  the  synthesized  speech.  A new  ap- 
proach to  formant  frequency  determination,  combined  with  a digital  implementation,  promises 
to  eliminate  these  effects  and  to  yield  a useful  formant  vocoder.  Additional  redundancy  reduc- 
tion of  information  is  obtained  by  means  of  a pattern-matching  technique,  which  encodes  the 
three  formant  frequencies  into  seven  bits  per  frame  to  provide  qjeech  synthesis  at  600  bps. 
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This  is  accomplished  by  using  an  existing  2400-bps  linear-predictive-encoder  (LPE)  with  some 
additional  processing.  A demonstration  record  of  processed  speech  at  600  bps  is  included 
with  the  report. 
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600-BIT-PER-SECOND  VOICE  DIGITIZER 
(LINEAR  PREDICTIVE  FORMANT  VOCODER) 


INTRODUCTION 

This  report  presents  an  analysis/synthesis  method  whereby  speech  may  be  transmitted 
at  600-bits-per-second  (bps),  a data  rate  which  is  less  than  1 percent  of  the  pulse-code- 
modulation  (PCM)  transmission  rate  for  original  speech  sounds.  This  R&D  effort  was  mo- 
tivated by  the  pressing  need  for  very-low-data-rate  (VLDR)  voice  digitizers  to  meet  some 
of  the  present  Navy  voice  communication  requirements.  The  use  of  a VLDR  voice  digitizer 
makes  it  possible  to  transmit  speech  signals  over  adverse  channels  which  support  data  rates 
of  only  a few  hundred  bps  or  to  transmit  speech  signals  over  more  favorable  channels  with 
redundancies  for  error  protection  or  for  other  useful  applications.  The  600-bps  synthesized 
speech  loses  some  of  its  original  speech  quality,  but  the  intelligibility  is  sufficiently  high  to 
permit  the  use  of  the  system  in  certain  specialized  military  applications. 

One  of  the  most  attractive  features  of  the  VLDR  voice-digitizer  technique  presented 
in  this  report  is  that  it  is  a simple  extension  of  a 2400-bps  linear  predictive  encoder  (LPE) 
(Fig.  1)  which  has  been  under  intensive  investigation  by  the  Navy  and  other  various  gov- 
ernment agencies  and  is  presently  entering  advanced  development.  It  is  anticipated  that 
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2400-bps  LPEs  will  be  extensively  deployed  in  support  of  DOD  and  other  government 
communications.  In  essence  the  600-bps  voice  digitizer  is  a 2400-bps  LPE  with  an  add-on 
processor  at  the  transmitter  and  the  receiver.  This  add-on  processor  converts  the  2400- 
bps  speech  data  to  600-bps  speech  data  at  the  transmitter  and  reconverts  the  data  to  2400 
bps  at  the  receiver. 

To  elaborate  this  point,  the  parameters  encoded  by  a typical  2400-bps  LPE  will  be 
briefly  reviewed.  An  LPE  derives  two  sets  of  parameters  from  speech  waveforms.  One 
is  a set  of  predictive  coefficients,  estimated  by  the  least-squares  method,  which  describes 
the  signal  transformation  characteristics  of  the  vocal  tract.  The  other  set  describes  the 
excitation  waveforms,  i.e.,  pitch  period,  power  level,  and  voice/unvoice  decision  (buzz / 
hiss  selection),  that  define  the  driving  signal  for  the  vocal-tract  filter.  The  vocal-tract  filter 
is  a digital  recursive  filter  in  which  filter  parameters  are  predictive  coefficients.  In  a typi- 
cal 2400-bps  LPE  all  of  these  parameters  are  derived  once  every  22.5  milliseconds  and 
quantized  in  54  bits. 

With  respect  to  the  parameters  transmitted  by  a 2400-bps  LPE,  the  following  modi- 
fications are  incorporated  in  the  600-bps  voice  digitizer: 

• The  parameter  update  interval  is  increased  from  22.5  to  25  milliseconds.  This  10- 
percent  increase  is  an  undesirable  but  necessary  compromise  between  the  2400-bps  LPE 
updata  rate  and  the  number  of  bits  per  frame  to  realize  an  overall  data  rate  of  600  bps. 

• The  excitation  parameters  are  virtually  identical  to  those  for  a 2400-bps  LPE,  but 
the  pitch  period  is  updated  once  every  other  frame.  Transmission  of  the  pitch  period  and 
the  excitation  power  level  require  260  bps,  or  43  percent  of  the  overall  data  rate.  This 
high  percentage  of  the  transmission  rate  is  considered  necessary,  since  the  pitch  period 
and  the  excitation  power  level  are  essential  for  natural  speech  reproduction,  which  is  so 
vital  to  acceptable  voice  communications. 

• The  vocal-tract-filter  parameters  take  two  forms  depending  on  the  voicing  state: 
the  formant  frequencies  for  voiced  sounds,  and  predictive  coefficients  for  unvoiced  sounds. 
Voiced  sounds  (mostly  vowels)  are  well  characterized  by  the  impulse  response  of  the  vocal- 
tract  filter  having  three  resonance  frequencies  (the  first  three  formant  frequencies).  There- 
fore, if  speech  is  voiced,  three  formant  frequencies,  derived  from  predictive  coefficients, 
are  transmitted.  To  economize  the  data  rate,  neither  the  formant  bandwidths  nor  the 
formant  intensities  are  transmitted.  On  the  other  hand,  predictive  coefficients  are  directly 
transmitted  for  unvoiced  sounds  (fricatives),  because  they  are  poorly  characterized  in  terms 
of  formant  frequencies. 

Figure  2 is  a block  diagram  of  the  600-bps  voice  digitizer.  The  blocks  bordered  by 
the  double  lines  indicate  the  processors  added  to  a 2400-bps  LPE  to  provide  a 600-bps 
transmission  capability. 

The  most  critical  process  in  the  600-bps  voice  digitizer  is  formant  tracking.  The  ma- 
jority of  previous  formant-tracking  methods  relied  on  some  form  of  spectral  analysis  of 
the  speech  waveform,  which  is  in  essence  the  evaluation  of  the  vocal-tract-filter  transfer 
function  along  the  unit  circle  in  the  z plane.  Although  the  spectral  analysis  is  relatively 
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simple,  it  is  often  unable  to  detect  the  vocal-tract-filter  poles  located  well  inside  the  unit 
circle;  i.e.,  the  frequency  spectrum  does  not  peak  sharply  at  the  frequencies  corresponding 
to  the  arguments  of  poles.  Likewise  two  sets  of  closely  adjacent  poles  are  often  detected 
as  one  pole,  leading  to  a formant-misidentification  problem.  Such  phenomena  are  com- 
monly observable  in  speech  spectrographs.  Thus  it  is  not  surprising  that  formant  tracking 
has  long  been  regarded  as  impractical  (the  Background  section  will  give  additional  infor- 
mation). 

The  600-bps  voice  digitizer  described  in  this  report  uses  predictive  coefficients  for 
formant  tracking.  The  use  of  predictive  coefficients  as  source  material  for  formant  track- 
ing has  merit  because  the  coefficients  appear  in  the  expression  of  the  vocal-tract-filter 
transfer  function  as  a simple  algebraic  form: 


1 - “l  In*”1  ^*2\n*~2  an\nZ~n  ’ 

where  the  a’s  are  predictive  coefficients,  z is  a complex  variable,  and  n is  the  total  number 
of  filter  coefficients  (the  order  of  prediction).  The  roots  of  the  denominator  provide  the 
poles  of  the  vocal-tract  filter.  The  arguments  of  the  poles  are  linearly  related  to  the  for- 
mant frequencies,  and  the  moduli  of  the  poles  are  logarithmically  related  to  the  formant 
bandwidths.  Extraction  of  these  roots  requires  polynomial  factorization,  which  has  been 
well  explored  and  documented  through  the  past  two  centuries.  However  there  are  two 
reasons  for  avoiding  this  process  in  determining  the  formant  frequencies.  First,  it  requires 
complex  arithmetic  and,  usually,  high-precision  computations.  Second,  the  600-bps  voice 
digitizer  does  not  require  the  formant  bandwidth  information.  Thus  the  use  of  polynomial 
factorization  (which  provides  such  information)  would  not  be  fully  justified  unless  compu- 
tations are  simple.  Since  this  is  not  the  case,  a simple  alternative  method  of  estimating 
formant  frequencies  was  chosen  in  the  present  600-bps  voice  digitizer. 

This  method  proceeds  in  two  steps:  first  an  initial  approximation  and  then  a subse- 
quent refinement.  The  first  step  moves  all  the  poles  toward  the  unit  circle  in  the  z plane, 
so  that  a simple  spectral  analysis  can  provide  all  the  formant  frequencies  as  the  initial 
starting  point  for  the  second  step.  The  poles  are  moved  toward  the  unit  circle  by  simply 
letting  the  last  predictive  coefficient  c^,)n  (the  product  of  all  pole  moduli)  be  near  unity. 

Thus  each  individual  pole  modulus  approaches  unity,  which  implies  that  all  the  poles  are 
near  the  unit  circle. 

If  the  poles  move  radially  when  approaches  unity,  then  the  formant  frequency 
is  exact.  Generally  however  the  poles  do  not  move  radially  (the  ideal  case)  as  approaches 
unity;  therefore  the  formant  frequencies  are  shifted  from  their  true  values.  These  shifts  do 
not  appear  to  be  excessive  for  voiced  sounds.  This  first  step  produces  two  useful  results: 
all  formant  frequencies  are  distinct  and  naturally  ordered  (they  are  separated  as  fx>  f2, 
and  f3),  and  all  formant  frequencies  are  always  captured.  These  two  results  are  most  bene- 
ficial for  accomplishing  successful  formant  tracking. 

The  second  step  is  the  refinement  of  these  initial  formant  frequency  estimates.  As 
o^|n  moves  toward  it  actual  value,  the  frequency  response  is  recomputed  for  a small  range 
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around  the  previous  formant  frequency.  Since  anjn  lies  theoretically  between  -1  and  1 
(in  most  cases  somewhere  between  -0.5  and  0.5),  a few  iterations  at  an  incremental  step 
of  -0.2  will  find  a sufficiently  accurate  formant  frequency.  This  procedure  is  applied  to 
all  formant  frequencies,  and  if  any  formant  frequency  disappears  during  the  iteration,  its 
previous  value  is  retained. 

When  determination  of  the  three  formant  frequencies  is  complete,  the  frequencies 
must  be  coded  into  seven  bits  to  meet  the  data-rate  limitation.  The  remaining  eight  bits 
per  frame  are  allocated  to  the  excitation  parameters  and  synchronization.  Normally  ten 
bits  per  frame  are  required  for  coding  three  formant  frequencies  (three  bits  for  flt  four 
bits  for  f2,  and  three  bits  for  f3).  The  most  effective  way  of  coding  three  formant  fre- 
quencies into  seven  bits  is  by  pattern  matching  (by  coding  the  three  formant  frequencies 
jointly).  Fortunately  certain  combinations  of  formant  frequencies  do  not  occur,  a charac- 
teristic which  permits  a pattern-matching  technique  to  exclude  these  classes  in  the  codes. 
Thus  the  128  formant  patterns  (27  patterns)  are  selected  from  many  speech  samples  through 
a technique  similar  to  “cluster  analysis.”  Similarly,  the  six  predictive  coefficients  are  clas- 
sified into  128  patterns  for  the  unvoiced  case. 

At  the  receiver  the  formant  frequencies  are  converted  to  six  predictive  coefficients 
and  become,  as  in  a 2400-bps  LPE,  the  weights  of  the  vocal-tract  filter. 

The  subsequent  sections  of  this  report  discuss  the  past  history  of  formant  tracking, 
previous  600-bps  voice  digitizers,  and  the  implementation  of  the  present  600-bps  voice 
digitizer.  A demonstration  record  containing  samples  of  600-bps  speech  is  included  with 
the  report. 


BACKGROUND 

Both  formant-tracking  vocoders  and  600-bps  voice  digitizers  have  existed  for  some 
time.  This  section  presents  some  of  their  history.  In  addition  the  theory  of  linear  pre- 
dictive analysis  is  briefly  reviewed,  because  it  is  the  underlying  principle  of  the  present 
600-bps  voice  digitizer. 


History  of  Formant  Tracking 

The  development  of  the  formant-tracking  vocoder  has  had  a long  and  arduous  history 
since  its  inception  [ 1 ] . Its  motivations  were  no  doubt  started  with  the  publication  of 
Visible  Speech  (2),  which  combined  a hope  of  visual  speech  perception  (for  the  deaf) 
with  »..ie  successful  development  of  the  “sound  spectrograph”  by  the  Bell  Telephone  Labo- 
ratories. The  fascinating  patterns,  interpreted  phonemically  in  Visible  Speech,  combined 
with  the  apparent  ease  in  visually  identifying  and  tracking  formants  on  the  sound  spectro- 
graph, led  to  the  construction  of  a breadboard  formant  vocoder  by  Flanagan  in  1956  [3], 
which  gave  imperfect  yet  promising  results.  Flanagan’s  work  laid  the  groundwork  for 
development  work  during  the  late  1950’s  and  early  1960’s  by  such  diverse  organizations 
as  Northeastern  University  (Chang  [4,  5))  with  the  Formoder  and,  under  government- 
industrial  contracts,  Melpar  [5],  Philco  [6],  General  Dynamics  (Stromberg  Carlson  Division) 
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[7],  and  others  [8].  Most  of  this  R&D  work  was  supported  by  the  government  for  pos- 
sible military  application  and  was  largely  terminated  in  1966  when  the  government  decided 
to  use  the  older  channel  vocoders  of  Homer  Dudley  [9,10] . At  this  point  it  was  recognized 
that  the  channel  vocoder,  although  requiring  at  least  twice  the  bit  rate  of  the  formant  vocoder, 
was  somewhat  more  intelligible  and  also  more  highly  developed.  A joint  service  effort  was 
made  to  procure  channel  vocoders  for  the  USC-20  program,  which  ultimately  failed  and  was 
canceled  after  4 years.  But  during  this  time  the  choice  of  a channel  vocoder  for  this  pro- 
gram caused  further  research  into  formant  vocoders  to  be  largely  suspended. 

In  other  parts  of  the  world,  Sir  Walter  Lawrence  [11]  sought  support  for  the  formant 
vocoder  concept  with  a U.S.  tour  demonstrating  his  synthesizer  PAT  (parametric  automatic 
talker)  driven  from  formant  traces,  and  Gunnar  Fant  was  using  his  formant  synthesizer  OVE 
for  basic  research  into  speech  production,  leading  to  his  book  Acoustic  Theory  of  Speech 
Production  [121  in  1970.  By  this  time  it  was  well  established  that  the  formant  vocoder 
concept,  though  attractive  because  its  implementation  would  permit  a lower  bit  rate,  was 
frought  with  practical  difficulties.  In  addition  to  the  channel-vocoder  problems  of  pitch 
tracking  and  voicing  decision,  the  formant  vocoder  had  problems  of  proper  formant  track- 
ing, formant  identification,  formant  acquisition  for  tracking  after  a silence,  and  synthesis 
problems,  particularly  in  consonant  production.  Thus  potential  users  of  the  formant  vocoder 
became  skeptical  as  to  the  probable  success  of  this  approach  for  low-bit-rate  voice  coding. 

This  skepticism  is  exemplified  by  Moye  [13],  who  wrote,  “Although  such  a statement  is 
bound  to  be  challenged,  one  can  say  that,  from  the  point  of  view  of  practical  digital  speech 
transmission  systems,  formant  analysis  does  not  work.”  Others  [14,15]  have  expressed 
similar  views. 


Previous  600-bps  Voice  Digitizers 

According  to  published  accounts  there  have  been  at  least  three  previous  VLDR  voice 
digitizers.  By  coincidence  they  are  all  600-bps  voice  digitizers.  Flanagan  [16]  demonstrated 
a formant-tracking  vocoder  operating  at  600  bps  and  demonstrated  his  results  in  the  phono- 
graph record  he  attached  to  his  article.  Since  the  test  sentence  was  composed  of  all  vowels, 
dipthongs,  and  liquids  (“We  were  away  a year  ago”),  it  was  a very  limited  demonstration  of  a 
600-bps  voice  digitizer.  Nevertheless  the  synthesized  speech  was  highly  articulate,  indicating 
that  the  formarit-vocoding  approach  had  potential  for  voice  analysis  and  synthesis.  Another 
600-bps  voice  digitizer  was  developed  by  Caldwell  Smith  at  the  Air  Force  Cambridge  Research 
Laboratory  [ 17] . The  device  employed  a pattern-matching  technique  to  classify  the  channel 
vocoder  outputs  and  was  the  result  of  extensive  R&D  work.  Its  intelligibility  score  of  92  per- 
cent for  a single-talker  diagnostic  rhyme  test  (DRT)  [18]  was  an  exceptionally  high  score  for 
a 600-bps  voice  system.  A third  600-bps  system  consisted  of  a modified  version  of  the  Melpar 
formant  vocoder,  presented  by  tape  demonstration  at  the  70th  meeting  of  the  Acoustical  Soci- 
ety of  America  in  November  1965. 


Summary  of  Linear  Predictive  Analysis 

Because  the  present  600-bps  voice  digitizer  uses  the  coded  output  of  a 2400-bps  LPE, 
the  basic  principles  and  mathematical  theory  of  linear  predictive  analysis  are  briefly  sum- 
marized in  the  next  few  pages  to  facilitate  discussions  of  the  600-bps  voice  digitizer.  Much 
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of  this  theory  has  been  well  developed  in  connection  with  the  implementation  of  voice 
digitizers  operating  at  2400  bps  or  higher  rates  (19-25], 

In  linear  predictive  analysis  a speech  sample  is  represented  by  a linear  combination 
of  past  samples.  Thus 


= al|n*/-l  + a2|nxt-2  + • • • + “n|n*<-n»  (1) 

where  xt  is  a speech  sample  at  time  t,a;|n  is  the  jth  predictive  coefficient,  and  n is  the 
order  of  prediction.  A set  of  predictive  coefficients  is  derived  by  way  of  minimizing  the 
mean-square  value  of  the  prediction  residual,  defined  by 

et=xt-  xt.  (2) 

By  the  application  of  the  classical  least-squares  method,  a set  of  predictive  coefficients 
which  minimizes  the  prediction  residual,  under  the  condition  of  stationarity,  is  obtained 
from 
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where  i fj  is  the  autocorrelation  coefficient  of  the  speech  signal  defined  by 


N-i-j 

fij  ~ xmxm  +/> 

m-0 


where  N is  the  number  of  speech  samples  entered  into  the  correlation  analysis.  Under  the 
assumption  of  stationarity 


fij  * fi-j-  (5) 

Equation  (3)  is  a set  of  simultaneous  linear  equations  with  a doubly  symmetric  coeffi- 
cient matrix  (a  Toplitz  matrix).  The  solution  of  similar  problem  has  been  encountered  in 
statistics,  and  its  simpler  recursive  solution  is  well  known  [26,27] . The  solution  of  Eq.  (3)  is 


where 


“iln  ~ “in-1  ”“n|n  “n-i|n-l>  • ~ li  2,  . . .,  B-l, 


n-1 


fin  ‘Ai-/ai|fi-i 

i*i  > 

“n|n • " * 2, 


fio 


(Aai|n-l 


i-1 


(6) 


(7) 
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and,  when  n = 1, 

*1 

"Hi  <8> 

The  analysis  filter  is  a filter  which  generates  the  prediction  residual  as  it  is  driven  by 
the  speech  signal.  Thus  the  analysis-filter  output  is  the  difference  between  the  given  and 
the  predicted  speech  signals.  Therefore  the  transfer  function  of  the  analysis  filter,  denoted 
by  An(z),  is 


An(z)  = 1 - Pn(z),  (9) 

where  Pn(z)  is  the  transfer  function  of  the  nth-order  predictor.  By  the  z transform  of 
Eq.  (1),  Pn(z)  is  expressed  as 


= “lln*'1  + V2  + • • • + an\nz~n-  (10) 

From  Eqs.  (9)  and  (10)  the  transfer  function  of  the  analysis  filter  becomes 

An(z)  = 1 - [a^z-1  + a^z- 2 + . . . + an|„z-«)  . (11) 

The  structure  of  the  analysis  filter  is  shown  in  Fig.  3a. 


PREDICTION 

RESIDUAL 


(•)  Analysis  filter 

Fig.  3 — Analysis  and  Synthesis  filters  with  predictive  coefficients  as  weights 
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The  synthesis  filter  (the  vocal-tract  filter)  is  an  inverse  of  the  analysis  filter.  Thus  the 
transfer  function  of  the  synthesis  filter,  denoted  by  Hn(z),  is 


"«<*>  = 


1 

A.(z) 


1 

1 - “lln*'1  + «2| nz'2  + • • • + <*h| nz‘" 


(12) 


Since  only  the  denominator  of  Hn(z)  is  a function  of  the  complex  variable  2,  the  vocal- 
tract  filter  has  only  poles.  As  a result  the  properties  of  the  vocal-tract  filter  are  entirely 
determined  by  the  locations  of  poles.  The  vocal-tract  filter  (Fig.  3b)  is  structured  as  a 
positive  feedback  in  which  a predictor  is  in  the  feedback  loop.  If  the  vocal-tract  filter  is 
driven  by  the  prediction  residual,  the  synthesized  speech  would  be  identical  to  the  given 
speech.  However,  a voice  digitizer  operating  at  a bit  rate  below  the  speech  sampling  fre- 
quency uses  some  form  of  artificial  excitation. 

SYNTHESIZED 

SPEECH 


Fig.  3 (Continued)  — Analysis  and  syntheiis  filters  with  predictive 
coefficient*  as  weights 


The  last  predictive  coefficient  of  each  iteration  cycle  (a„|„  expressed  by  Eq.  (7)  or  (8)) 
is  often  referred  to  as  the  partial  correlation  coefficient,  denoted  by  kn : 

K » <13> 
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It  is  possible  to  construct  an  analysis  and  a synthesis  filter  in  which  the  filter  weights  are 
partial  correlation  coefficients.  From  Eqs.  (6)  and  (11)  the  transfer  function  of  the  nth- 
order  analysis  filter  in  terms  of  the  (n-1)  th-order  analysis  filter  is 


Let 

(14) 

Bn- 1(*) 

From  Eqs.  (14)  and  (15)  An(z ) in  terms  of  i4n.1(z)  and  Bn,1(z)  is 

(15) 

An(z)  - A,-!**)  - *„«„_!(*). 
Substituting  z'1  for  z in  Eq.  (14)  gives 

(16) 

From  Eqs.  (15)  and  (17)  Bn(z)  in  terms  of  A^z)  and  Bn.x(z ) is 

(17) 

Bn(z)  ~ z 1 [®n- l(z)  “ knAn.x(zj^. 

(18) 

Equations  (16)  and  (18)  define  the  structure  of  the  analysis  filter,  as  shown  in  Fig.  4a. 
The  performance  of  this  cascade-lattice  form  of  an  analysis  filter  is  identical  to  the 
transversal-filter  form  of  an  analysis  filter  shown  in  Fig.  3a. 


|-w (n-1)  STAGES «-j-» nth  STAGE 


(■)  Analysis  filter 

Fig.  4 — Analysis  and  syntheais  filters  with  partial  correlation  coefficients  and  weights 


The  synthesis  filter  is  the  inverse  of  the  analysis  filter.  Thus  its  transfer  function  is 


1 


(19) 
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Substitution  of  Eq.  (16)  into  Eq.  (19)  gives 


1 


An-\W 


1 - K 


(jr1) 


v»-l 


(z) 


(20) 


Figure  4b  shows  the  structure  of  the  vocal-tract  filter  in  which  the  filter  weights  are  partial 
correlation  coefficients.  The  filter  is  a cascade-lattice  network.  The  performance  of  this 
filter  is  identical  to  that  shown  in  Fig.  3b,  provided  the  initial  conditions  for  both  filters 
are  identical.  As  expressed  by  Eq.  (20),  the  last  partial  correlation  coefficient  ( kn ) behaves 
as  the  feedback  gain,  and  the  transfer  function  of  the  quantity  inside  the  bracket  is  that 
of  an  all-pass  filter. 


nth  STAGE 


|n-1)  STAGES 


SYNTHESIZED 
SPEECH  OUT 


Fig.  4 (Continued)  — Analysis  and  synthesis  Alters  with  partial  correlation  coefficients  and  weights 


A number  of  significant  properties  of  partial  correlation  coefficients  are  the  following: 

• The  vocal-tract  filter  is  stable  if  each  partial  correlation  coefficient  has  a magnitude 
less  than  unity  (26). 

• If  the  vocal-tract  filter  is  in  a cascade-lattice  configuration  (Fig.  4b),  the  partial 

correlation  coefficient  can  be  processed  directly  at  each  filter  section  by  the  minimization 
of  its  output  residual.  Thus  neither  Eqs.  (6)  through  (8)  nor  the  knowledge  of  speech 
correlation  coefficients  (ip0,  • • •)  *•  required.  From  Fig.  4a  the  prediction  residual  from 

the  nth-stage  output  is  expressed  by 

en,t  “ en-ljt  ~ 
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where  e„_1(  is  the  output  from  the  An,l(z)  filter  branch  (often  referred  to  as  the  forward 
prediction  residual)  and  is  the  output  from  the  Bn.1(z)  filter  branch  (often  referred 

to  as  the  backward  prediction  residual).  The  partial  correlation  coefficient  which  minimizes 
the  mean-square  value  of  en  t is 


where 


(22) 


un- 1 


(23) 


and 


Pn-1  ■ £(*2.],t).  (24) 

in  which  E(-)  denotes  the  expectation  operation  which  is  the  time-averaging  process  in  prac- 
tice. As  can  be  noted  from  Eq.  (22),  a partial  correlation  coefficient  is  a power  ratio,  with 
the  numerator  being  a crosscorrelation  between  the  forward  and  backward  prediction  residu- 
als and  the  denominator  being  the  backward  prediction  residual  power.  Under  the  condition 
of  stationarity  the  forward-prediction-residual  power  equals  the  backward  prediction  residual 
power.  Equation  (22)  is  a mathematical  equivalent  of  the  previous  definition  of  kn  = an|„ 
expressed  by  Eq.  (7).  Thus  Eq.  (22)  is  the  basis  for  computing  partial  correlation  coeffi- 
cients directly  from  the  analysis  filter. 


• The  output  (forward)  prediction  residual  in  terms  of  the  input  (forward)  prediction 
residual  at  each  section  may  be  obtained  from  Eq.  (21).  Squaring  both  sides  of  Eq.  (21) 
and  passing  the  resulting  quantity  through  the  expectation  operation  gives 

Pn  +*2)-2*„Un-l-  (25) 


From  Eq.  (22) 


“n-X“*nPn-l-  (26) 

From  Eqs.  (25)  and  (26)  p„  in  terms  of  pn_l  is 

Pn  (27) 

• A set  of  partial  correlation  coefficients  can  be  converted  to  a set  of  predictive 
coefficients  by  the  recursion  relationship  expressed  by  Eq.  (6). 


• Conversely,  a set  of  predictive  coefficients  can  be  converted  to  a set  of  partial 
correlation  coefficients  by 


ai\n  * ^nQw-i[/i 
l-*2  ’ 


(28) 
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where  / = 1,2,...,  n-1.  This  relationship  can  be  derived  from  the  solution  of  the  two 
simultaneous  equations  consisting  of  Eq.  (6)  and  its  mirror-image  equat!on  (the  equation 
in  which  the  index  i is  replaced  by  n-i). 


IMPLEMENTATION  OF  THE  600-BPS  VOICE  DIGITIZER 

A technical  overview  of  the  600-bps  voice  digitizer  was  given  in  the  Introduction. 
This  section  discusses  in  detail  how  the  2400-bps-LPE  output  data  can  be  converted  to 
600-bps  data.  The  items  under  discussion  include: 

• Definitions  of  formant  frequency  and  formant  bandwidth, 

• Frequency  response  of  the  vocal-tract  filter, 

• A different  approach  to  formant  tracking, 

• Parameter  coding, 

• Synthesis  filter, 

• Assumptions  on  formant  synthesizer  bandwidth, 

• Excitation  signal  generation,  and 

• Parameter  interpolation. 


Definitions  of  Formant  Frequency  and  Formant  Bandwidth 

Each  pole  of  the  vocal-tract  filter  may  be  represented  by  its  real  and  imaginary  parts, 
or  its  argument  (the  angular  displacement)  and  modulus.  The  formant  frequency  is  linearly 
proportional  to  the  argument  of  a pole,  and  the  formant  bandwidth  is  logarithmically  pro- 
portional to  the  modulus  of  a pole. 


As  was  given  by  Eq.  (12),  the  vocal-tract-filter  transfer  function  is 


1 

1 -(w1  + V2  * • • • + W")  ’ 


(12) 


which  can  be  rearranged  as 

«„(*) 


1 

z'"(z"  - axin'*"'1  - 


1 

n ’ 

rn  n<- 

i-l 


(29) 
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where  z,  is  the  ith  pole  of  the  vocal-tract  filter.  By  the  definition  of  the  z-transform  vari- 
able each  pole  can  be  expressed  in  terms  of  its  real  and  imaginary  parts.  Thus 

z,  = exp  + j to/)  t] 

= r,  exp  0 r),  (30) 

where  r(  is  the  radial  distance  of  the  pole  z(-  defined  by 

ri  m W 

= exp  (-/u,  r)  (31) 

and  r is  the  sampling  period  of  speech  signals.  For  a stable  vocal-tract  filter,  p,  = 0,  which 
implies  that  rt  = 1. 

In  Eq.  (30)  co,  is  a formant  frequency  in  radians  per  second.  Solving  for  the  formant 
frequency  in  Eq.  (30)  gives 

fi  = ^ ®rg  (*;)  Hz-  (32) 

Hence  a formant  frequency  is  linearly  proportional  to  the  argument  of  the  corresponding 
pole. 

The  pole  modulus  in  Eq.  (31)  is  the  envelope  decay  rate  of  the  vocal-tract-filter  im- 
pulse response,  and  /i,  is  numerically  equal  to  the  real  part  of  the  ith  pole  in  the  complex 
s plane  (the  comer  frequency  in  radians  per  second).  Thus  the  3-dB  bandwidth  of  the  ith 
pole,  denoted  by  Afit  is 

Afi  = — Hz.  (33) 

ITT 

From  Eqs.  (31)  and  (33)  the  3-dB  bandwidth  in  terms  of  the  pole  modulus  is 

Afi  * ~ Gn  (rf)  Hz,  rt  < 1.  (34) 

Hence,  the  3-dB  bandwidth  of  a pole  is  logarithmically  proportional  to  its  modulus. 


Frequency  Response  of  the  Vocal-Tract  Filter 

The  majority  of  previously  constructed  formant  estimators  were  based  on  the  spectral 
analysis  of  speech  signals,  meaning  that  the  frequency  that  corresponds  to  a spectral  enve- 
lope peak  was  regarded  as  the  formant  frequency.  Although  there  are  subtle  differences 
(28],  essentially  similar  results  may  be  obtained  by  the  evaluation  of  the  vocal-tract  transfer 
function  along  the  unit  circle  in  the  z plane,  (the  frequency  response). 

An  important  point  is  that  some  of  the  poles  may  not  be  reflected  as  peaks  in  the 
frequency  response  of  the  vocal-tract  filter  because  of  their  remote  positions  from  the  unit 
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circle  and/or  the  interference  by  other  poles.  Hence  a peak-picking  process  based  on  the 
vocal-tract  frequency  response  often  misses  formant  frequencies.  Therefore  certain  com- 
plicated (and  usually  ad  hoc)  procedures  are  required  to  track  formant  frequencies  (29). 


Nevertheless  the  frequency  response  of  the  vocal-tract  filter  can  be  of  value  if  properly 
used,  as  in  the  present  600-bps  voice  digitizer.  A main  advantage  for  using  the  frequency  - 
response  function  is  that  it  requires  relatively  simple  and  real  arithmetics. 


Since  the  phase  response  has  no  intrinsic  value  for  picking  peak  resonance  frequencies, 
the  power-response  function  is  used  [30]; 


Substituting  Eq.  (12)  into  Eq.  (35)  and  letting  z * exp  (/ur)  gives 

1 1 


2tt 


(35) 


(36) 


10  + 2^  A,  cos  (icjr) 
i-l 


where 


^0  = l+“ll„+a21n  + ---+an|n-  | 

~ + al\f,a2in  * a2l n°3|n  + * • • + an-l|nan|n» 

A2  = -a2|n  + al|/i0t3|/»  + a2|na4|n  + • • • + an-2lnan|n> 


An  - / 

As  expressed  by  Eq.  (36),  the  frequency  response  of  the  vocal-tract  filter  is  a reciprocal 
function  of  a Fourier  series  in  which  the  expansion  coefficients  are  the  autocorrelation 
coefficients  of  the  analysis-filter  impulse  response  expressed  by  Eq.  (11).  The  resonance 
frequencies  of  the  vocal-tract  filter  correspond  to  the  frequencies  which  make  the  denomi- 
nator of  Eq.  (36)  exhibit  local  minima.  Research  of  local  minima  may  be  effected  by  the 
evaluation  of  the  denominator  for  discretely  selected  frequencies.  The  term  cos  (iwr)  may 
be  stored  as  a set  of  constants  to  facilitate  the  computations.  If  the  speech  sampling  rate 
is  8000  Hz  and  the  desired  frequency  resolution  is  60  Hz,  the  term  cos  (iwr)  takes  only 
41  values  with  signs. 

Figure  5 illustrates  the  vocal-tract-filter  frequency  response  computed  from  the  spoken 
words  “happy  hour.”  Each  trace  represents  the  spectral  intensities  from  0 to  4000  Hz. 

The  trace  is  renewed  every  25  milliseconds.  As  expected,  unvoiced  segments  (/h/  and  /pi) 
do  not  exhibit  sharp  resonance  peaks,  but  vowels  produce  three  to  four  recognizable  reso- 
nance peaks.  Despite  the  simplicity  of  computation  the  direct  use  of  the  frequency- 
response  function  does  not  lead  to  successful  formant  tracking. 
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Fig.  5 — Frequency  response  of  the  vocal-tract  filter  estimated  from  actual  speech  signals 


A Different  Approach  to  Formant  Tracking 

Formant  tracking  is  a process  of  estimating  formant  frequencies,  and  logging  each  into 
a designated  tracker  from  frame  to  frame.  Assignment  of  formant  values  to  a particular 
track  or  formant  number  is  required  because  each  formant  frequency  must  be  interpolated 
during  speech  synthesis. 

Formant  tracking  becomes  less  of  a problem  if  the  estimated  formant  frequencies  are 
naturally  ordered  and  rarely  drop  out.  Then  the  lowest  formant  frequency  simply  becomes 
flf  the  next  one  becomes  f2,  and  so  on.  The  present-600  bps  voice  digitizer  requires  reliable 
formant  extraction,  because  it  is  faced  with  a constraint  that  formant  frequencies  are  esti- 
mated only  once  per  frame,  hence  the  dynamics  of  formant  frequencies  during  the  intra- 
frame period  are  not  available.  Any  kind  of  ad  hoc  rules  or  other  “dead-reckoning”  schemes 
to  fill  in  missing  formant  frequencies  and/or  to  rearrange  erroneously  ordered  formant  fre- 
quencies are  virtually  unworkable  in  practice,  due  to  the  many  exceptions  that  prise. 
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The  600-bps  voice  digitizer  employs  a somewhat  unconventional  formant  extraction 
method  which  not  only  provides  sure  acquisition  but  also  maintains  a naturally  ordered 
formant  sequence.  The  method  proceeds  in  two  steps:  the  estimation  of  initial  (and 
approximate)  formant  frequencies,  and  subsequent  refinements  by  an  iterative  technique. 

The  first  step  of  the  operation  moves  all  the  poles  of  the  vocal-tract  filter  toward  the 
unit  circle  in  the  z plane.  This  is  accomplished  by  simply  letting  the  last  predictive  coef- 
ficient (which  is  numerically  equal  to  the  last  partial  correlation  coefficient)  be  near  unity. 
Once  the  poles  are  near  the  unit  circle,  the  frequency  response  of  the  vocal-tract  filter 
exhibits  extremely  sharp  resonance  peaks.  These  resonance  frequencies  will  serve  as  the 
initial  iteration  points  to  be  subsequently  refined  by  the  second  step  of  the  operation. 

The  initial  resonance  frequencies  are  approximate  because  the  poles  do  not  move  radially 
as  the  last  predictive  coefficient  approaches  unity. 

Figure  6 shows  a set  of  vocal-tract-filter  frequency  responses  in  which  the  last  predic- 
tive coefficient  was  successively  varied  from  its  actual  value  to  near  unity.  The  following 
vocal-tract-filter  parameters  were  derived  from  a voiced  segment  of  actual  speech:  fej  = 
0.860,  fe2  = -0.818,  k3  = -0.252,  fe4  = 0.311,  ft5  = 0.204,  k6  = 0.054,  k7  = 0.215,  k8  = 
-0.339,  kg  = 0.445,  and  kl0  = 0.005  (which  will  be  varied). 

In  Fig.  6 the  hidden  second  formant  frequency  in  the  original  vocal-tract-filter  response 
gradually  became  visible  as  the  last  predictive  coefficient  approached  unity.  This  phenomena 
may  be  explained  from  the  following  three  different  points  of  view: 


• Algebraic  point  of  view. 
of  the  predictive  coefficients  are 


From  Eq.  (29)  the  poles  of  the  vocal-tract  filter  in  terms 

n 

- a2\n^'2  ~ ■ • • * «n|n  = II  <2  “ *i>-  (38> 

i-1 


Thus  the  last  predictive  coefficient  (on|n  = kn ) is  a product  of  all  pole  moduli  of  the  vocal- 
tract  filter.  Therefore,  by  making  the  product  be  near  unity,  each  individual  pole  modulus 
becomes  near  unity,  signifying  that  all  poles  are  near  the  unit  circle  in  the  z plane. 

• Control  theory  point  of  view.  The  transfer  function  of  the  vocal-tract  filter  in 
terms  of  the  partial  correlation  coefficients  is  given  by  Gq.  (20)  as 


1 

An- 1 (2) 


1 


z'nAn-\  (2*1) 
. An- 1<*>  - 


(20) 


where  kn  is  the  nth  partial  correlation  coefficient  and  An,l(z)  is  the  (n-1)  th-order  analysis- 
filter  transfer  function.  The  vocal-tract  filter,  as  expressed  by  Eq.  (20),  is  a positive-feedback 
network  in  which  kn  behaves  as  a feedback  gain.  Since  the  quantity  inside  the  bracket  is 
a unity -gain,  all-pass  (frequency-independent)  filter,  the  loop  gain  is  determined  solely  by 
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kn.  As  k„  approaches  unity,  the  poles  migrate  toward  the  unit  circle.  The  trajectory  of 
the  poles  as  a function  of  the  feedback  gain  is  known  as  the  root  locus  [31].  Based  on 
the  vocal-tract-filter  parameters  required  to  construct  Fig.  6,  the  root  locus  is  plotted,  as 
shown  in  Fig.  7.  As  shown,  the  poles  do  not  move  radially,  which  means  that  the  initial 
formant  frequency  estimates  contain  errors  which  are  corrected  by  the  second  step  of  the 
. operation. 


I - 2000  H* 


• Acoustic  point  of  view.  If  the  effects  of  lung  and  nasal  cavities  are  omitted,  the 
vocal  tract  is  closely  approximated  by  cascaded  concentric  pipes,  each  having  equal  length 
L with  different  cross  sectional  areas  Aiy  A2,  ...  . The  reflection  coefficient  denoted  by 
p„  is  defined  as  the  ratio  of  the  difference  to  the  sum  of  two  adjacent  areas.  Thus 


Pn 


^n*l  * 
Ai*l  + 


(39) 


It  has  been  established  that  a partial  correlation  coefficient  equals  a negative  value  of  a 
reflection  coefficient  [32].  Hence  the  approach  of  kn  to  unity  implies  a complete  reflec- 
tion at  one  end  of  the  vocal-tract  filter  (a  lossless  case).  Thus  its  resonance  peaks  have 
infinitesimally  small  bandwidths. 


19 


KANG  AND  COULTER 


Figure  8 exemplifies  the  effectiveness  of  the  first  step  of  this  operation.  Figure  8a 
is  a plot  of  the  formant  frequencies  derived  from  actual  speech  samples  through  the  use 
of  Eq.  (36).  As  shown,  the  lack  of  continuity  makes  formant  tracking  almost  impossible. 
Figure  8b  is  the  result  of  the  first  step  of  this  operation  by  the  600-bps  voice  digitizer. 

All  formant  frequencies  are  always  present,  and  they  are  well  ordered  and  separated! 

The  second  step  of  this  operation  refines  these  initial  formant  estimates.  As  the  last 
predictive  coefficient  moves  toward  the  actual  value,  the  frequency  response  is  recomputed 
for  a small  range  around  the  previous  formant  estimate.  The  theoretical  range  of  the  last 
coefficient  is  between  1 and  -1.  However  actual  speech  samples  show  that  the  last  coef- 
ficient is  somewhere  between  0.5  and  -0.5.  A few  iterations  with  an  incremental  step  of 
= -0.2  will  find  substantially  accurate  formant  frequencies.  If  a formant  frequency 
disappears  during  this  iteration  cycle,  the  previous  value  is  retained. 


Parameter  Coding 


In  a manner  similar  to  a 2400-bps  LPE,  the  600-bps  voice  digitizer  transmits  two  sets 
of  speech  parameters:  the  vocal-tract-filter  parameters  and  the  excitation  parameters.  The 
excitation  parameters  include  the  pitch  period,  the  excitation  power  level,  and  the  voice/ 
unvoice  decision.  The  vocal-tract-filter  parameters  take  one  of  two  forms  depending  on 
the  voicing  state:  the  formant  frequencies  for  voiced  sounds  and  the  predictive  coefficients 
for  unvoiced  sounds. 

The  parameter-update  rate  was  chosen  as  40  Hz,  which  is  10  percent  slower  than  that 
of  a 2400-bps  LPE,  due  to  the  data-rate  limitation.  Thus  the  number  of  bits  per  frame 
equals  15  for  this  frame  rate  of  40  Hz. 

These  15  bits  could  be  allocated  in  the  following  manner:  one  bit  per  frame  for 
synchronization,  one  bit  per  frame  for  the  voice/unvoice  decision,  four  bits  per  frame  for 
the  amplitude  information  in  order  to  encompass  the  dynamic  range  of  speech  encountered 
in  normal  conversation,  and  the  last  nine  bits  per  frame  for  the  vocal-tract-filter  parameters 
and  the  pitch  information.  However  the  pitch  period,  even  though  it  is  a rather  important 
parameter  for  the  reproduction  of  more  natural  speech,  possesses  a contour  which  does  not 
vary  as  rapidly  as  other  speech  parameters  in  normal  conversation.  Thus  pitch  information 
can  be  transmitted  once  every  other  frame  without  causing  undue  mechanical  inflection  in 
the  synthesized  speech,  and  it  is  quantized  to  five  bits  logarithmically  from  50  to  300  Hz 
(12  steps  per  octave).  The  upper  cutoff  frequency  of  300  Hz  is  somewhat  lower  than 
might  be  desired,  but  this  is  a compromise  for  the  600-bps  voice  digitizer. 

Since  the  pitch  information  is  transmitted  once  every  other  frame,  it  is  necessary  to 
group  two  frames  in  one.  Therefore  only  one  synchronization  bit  is  required  for  every 
two  frames,  and  the  number  of  bits  available  to  code  vocal-tract-filter  parameters  becomes 
seven.  Table  1 shows  a comparison  in  bit  assignments  between  a typical  2400-bps  LPE 
and  the  600-bps  voice  digitizer. 

The  vocal-tract-filter  parameters  control  the  spectral  shape  or  tone  color  of  the  syn- 
thesized speech.  A 2400-bps  LPE  transmits  40  bits  describing  the  vocal-tract-filter 
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Table  1 — Parameter  Coding 


Coding 

Parameter 

Typical  2400-Bit-Per-Second 
Linear  Predictive  Encoder 

600-Bit-Per-Second 
Voice  Digitizer 

Frame  rate 

44.444  Hz 

40  Hz 

Vocal-tract-filter  parameters 

40  bits/frame 

7 bits/frame 

Excitation  parameters 
Voice/unvoice  decision 

1 bit/frame 

1 bit/frame 

Amplitude 

6 bits/frame 

4 bits/frame 

Pitch 

6 bits/frame 

5 bits/double  frame 

Synchronization 

1 bit/frame 

1 bit/double  frame 

Total  number  of  bits 

54  bits/frame 

30  bits/double  frame 

parameters,  but  600-bps  voice  digitizer  transmits  only  seven  bits.  The  reduction  from  40 
bits  to  seven  bits  is  tantamount  to  a reduction  from  approximately  1 trillion  tone  colors 
to  merely  128.  Therefore  the  600-bps  voice  digitizer  must  use  the  seven  bits  in  the  most 
effective  way. 

For  voiced  sounds  the  vocal-tract  filter  is  well  characterized  by  three  formant  frequen- 
cies. To  conserve  the  data  rate,  neither  formant  bandwidths  nor  formant  intensities  are 
transmitted.  On  the  other  hand  the  vocal-tract  filter  for  unvoiced  sounds  is  poorly  charac- 
terized in  terms  of  the  formant  frequencies.  This  is  because  the  majority  of  unvoiced 
speech  spectra  are  broader  and  lack  sharp  resonance  peaks.  Consequently  six  partial  cor- 
relation coefficients  are  transmitted  for  unvoiced  sounds  rather  than  the  three  formant 
frequencies. 

At  this  point  the  question  is  how  to  code  the  three  formant  frequencies  or  six  partial 
correlation  coefficients  so  that  the  total  number  of  bits  per  frame  will  not  exceed  seven. 

If  each  formant  frequency  were  quantized  independently,  at  least  ten  bits  would  be  required 
for  good  speech  synthesis  (Table  2).  Since  ten  bits  exceeds  the  transmission  capacity,  an 
alternative  approach  was  sought. 


Table  2 — Formant  Frequency  Coding 
If  the  Formant  Frequencies  Were 
Quantized  Independently 


Formant 

Range 

Number 

Frequency 

(Hz) 

of  Bits 

h 

150  to  1000 

3 

h 

700  to  2500 

4 

h 

1600  to  3100 

3 
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In  this  new  approach  formant  frequencies  are  not  quantized  independently,  because 
they  are  mutually  dependent  (f3  may  be  predicted  from  fx  and  f2  for  most  of  the  vowels) 
and  certain  combinations  of  formant  frequencies  do  not  occur  in  any  given  language.  That 
is,  formant  frequencies  are  highly  grouped,  as  shown  in  Fig.  9 [33].  Thus  a most  effective 
coding  may  be  achieved  by  the  consideration  of  all  formant  frequencies  jointly.  This  argu- 
ment has  led  to  a pattern-matching  approach  to  formant-frequency  coding. 


Fig.  9 — Mean  formant  frequencies  for  33  men  uttering  the  English  vowels.  (After 
Peterson  and  Barney  [33].)  (Words  which  would  contain  the  vowel  sounds  are 
given  in  parentheses.) 


To  select  128  reference  formant  patterns,  over  10,000  formant  frequencies  were  col- 
lected from  male  and  female  subjects.  These  formant  frequencies  were  classified  into  128 
patterns  in  such  a manner  that  the  Euclidian  distance  between  any  two  reference  patterns 
was  greater  than  a prescribed  value  (R): 

3 

^ [lfi.m  - ftj > wi ]*  >R2'  m'i~  2 128’  m (40) 

f-1 

where  fim  is  the  ith  formant  frequency  (i  = 1,  2,  and  3)  of  the  mth  pattern  and  w,  is  the 
weighting  factor  for  the  ith  formant  frequency. 

These  weighting  factors  emphasize  the  most  important  formant  frequencies  from  a 
perceptual  viewpoint.  For  example,  among  the  first  three  formant  frequencies,  f3  is  the 
least  important.  This  is  demonstrated  in  that  synthesized  speech  is  intelligible  in  most 
cases  with  fx  and  f2  only.  Notable  exceptions  are  for  /r/  and  ///,  which  cannot  reliably 
be  distinguished  by  fx  and  f2  alone.  Although  both  fx  and  f2  are  important,  it  has  been 
found  that  fx  should  be  weighted  more  heavily,  mainly  because  the  level  of  fx  is  more 
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constant  and  errors  or  fluctuations  in  its  values  are  more  obvious  to  the  human  ear.  Thus 
the  weighting  factors  were  chosen  as  = 3,  w2  = 2,  and  u>3  = 1.  The  magnitude  of  R 
was  selected  experimentally  to  be  400  Hz. 


At  the  transmitter  each  observed  formant-frequency  set  is  compared  with  the  stored 
reference  formant-frequency  patterns.  The  selected  pattern  is  based  on  the  minimum- 
distance  criteria: 


min 

m 


(41) 


where  fj  is  the  observed  ith  formant  frequency.  The  code  to  be  transmitted  is  simply  the 
index  of  the  chosen  reference  formant  set. 


Similar  procedures  are  applied  to  classify  six  predictive  coefficients  stemming  from 
unvoiced  sounds.  For  unvoiced  sounds  description  of  the  vocal-tract  filter  need  not  be 
precise.  An  illustration  of  this  point  is  that  when  Fransen  of  NRL  [34]  previously  applied 
a pattern-matching  technique  to  classify  predictive  coefficients  for  both  voiced  and  unvoiced 
sounds,  the  method  generated  high-quality  speech  at  1200  bps,  with  a diagnostic-rhyme-test 
(DRT)  intelligibility  score  of  88  percent. 


Synthesis  filter 

The  synthesis  filter  may  take  many  different  forms:  narrowband  filters  in  series,  nar- 
rowband filters  in  parallel,  a transversal  filter,  or  a cascade-lattice  filter.  Although  the  use 
of  narrowband  filters  is  simple,  a cascade-lattice  filter  was  used  as  the  synthesizer  for  this 
system  because  of  the  following  advantages: 

• The  transmitted  vocal-tract-filter  parameters  for  unvoiced  sounds  (partial  correlation 
coefficients)  can  be  used  directly  as  filter  weights. 

• The  necessary  excitation  power  level  which  produces  the  synthesized  power  equal 
to  the  input  speech  power  is  obtained  by  a simple  relationship: 

n 

IK1-*?)1*.’  <42) 

i-l 


where  Pn  and  Pt  are  the  excitation  power  for  the  synthesizer  and  the  input  signal  power 
respectively.  Equation  (42)  is  a direct  consequence  of  Eq.  (27). 

• The  intensity  of  the  individual  formant  frequency  is  automatically  weighted  by 

the  mutual  locations  of  the  poles  (as  in  the  serial  analog  vocal  tract  using  narrowband  filters). 

• The  cascade-lattice  synthesis  filter  was  already  available  in  the  Navy  experimental 
2400-bps  LPE. 
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For  voiced  sounds  the  formant-frequency  information  must  be  converted  to  predictive 
coefficients.  The  transfer  function  of  a filter  having  three  pairs  of  complex-conjugate  poles  is 

T(z)  = \ (43) 

3 

| J (l  - 2e'*ir  cos  <*)j  rz-1  + e'2^T  z'2) 

i“l 


where  to,  is  the  ith  formant  frequency  in  radians  per  second  and  the  factor  p is  related  to 
the  pole  modulus  as  indicated  by  Eq.  (31);  and  the  transfer  function  of  a sixth-order  vocal- 
tract  filter  in  terms  of  predictive  coefficients  is 

He («>  - 7 4 ;•  <«> 

1 ~ «1|6*  - a2[6z  2 * * - a6|6*  b 

Comparison  of  Eqs.  (43)  and  (44)  term  by  term  gives  a set  of  predictive  coefficients  in 
terms  of  formant  frequencies  and  the  pole  moduli  (related  to  formant  bandwidths).  Thus 

al|6  = _Bl  - B2  - B3, 

0t2|6  " -rl  - r2  - r3  - *1*2  - B1B3  - B2B3> 

a3|6  = ”(B2  + B3)rl  " (B1  + B3 )r2  " (B1  + B2)r3  ~ B1B2B3» 

(45) 

“416  “ ~rlr2  ‘ rlr3  ~ r2r3  - rlB2B3  ~ r2BlB3  * r3BlB2’ 

“5|6  * -rMB3  “ rlriBa  * r2r3Bl. 
a6(6  = -rlr2r3- 

where  B,  is  a simplified  notation  for 

B,  = -2e~*,T  cos  cjjT,  i = 1,  2,  or  3,  (46) 

and  r(,  as  defined  by  Eq.  (33),  is  the  ith-pole  modulus.  The  relationship  between  e~^T  and 
the  3-dB  formant  bandwidth  is  expressed  by  Eq.  (34).  Finally  the  set  of  predictive  coef- 
ficients can  be  converted  to  a set  of  partial  correlation  coefficients  through  the  use  of 
Eq.  (28). 

Formant-Bandwidth  Assumptions 

Formant  bandwidths  depend  not  only  on  the  respective  formant  frequencies  [35) 

(Fig.  10)  but  also  on  the  individual  quality  of  a particular  voice.  However  formant  band- 
widths  are  not  too  critical  to  speech  intelligibility.  Therefore  the  formant  bandwidths 
may  be  approximately  assigned  in  accordance  with  the  formant  frequencies,  if  the  individ- 
ual quality  is  not  too  important.  Examples  of  workable  assumptions  are 
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Fig.  10  — Formant  bandwidth  as  a function  of  formant  frequency  under 
conditions  of  a closed  glottis.  (After  Fant  [35].) 


4/i 


50  Hz.  if  ^ * 2000  Hz, 

50  ♦ 0.1  (/;  - 2000)  Hz,  if  f{>  2000  Hz, 


(47) 


or  fixed  values  for  each  formant  such  as  A/j  * 50  Hz,  A/2  = 60  Hz,  and  A f3  = 80  Hz. 


Excitation  Signal  Generation 

The  nature  of  the  excitation  signal  is  virtually  identical  to  that  used  for  a 2400-bps 
LPE  (a  pulse  train  for  voiced  sounds  and  random  noise  for  unvoiced  sounds).  Although 
not  mandatory,  inclusion  of  a real  pole  in  the  pulse  excitation  somewhat  alleviates  the 
tendency  toward  a nasal  quality  in  the  synthesized  voiced  sounds.  Likewise  slightly  pre- 
emphasized noise  assists  in  the  production  of  more  crisp  unvoiced  sounds. 


Parameter  Interpolation 

As  in  a 2400-bps  LPE,  parameters  require  interpolation  during  the  intraframe  period. 
The  pitch  period  and  unvoiced  sounds  require  interpolation  four  times  per  frame,  while  the 
excitation  power  may  be  interpolated  logarighmicaliy  pitch-synchronously  for  voice  sounds. 

The  six  partial  correlation  coefficients  transmitted  for  unvoiced  sounds  need  not  be 
interpolated.  On  the  other  hand  the  formant  frequencies  transmitted  for  voiced  sounds  are 
interpolated  pitch-synchronously.  An  important  point  is  that  there  is  not  interpolation 
across  voicing  transitions,  so  that  formant  frequencies  and  power  at  the  voiced  onset  (which 
is  critical  to  the  intelligibility)  can  be  captured  fully.  It  might  be  possible  to  further  im- 
prove thS  initial  second-formant  frequency  values  be  either  retaining  the  previous  value 
across  unvoiced  or  silence  intervals  or  using  simple  interpolation  rules  for  predicting  the 
second-formant  initial  value  from  its  last  known  value,  as  has  been  previously  suggested  in 
the  literature  [36], 
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EXPERIMENTION 

Three  important  tests  were  selected  to  illustrate  the  strengths  and  weaknesses  of  the 
600-bps  voice  digitizer: 

• The  diagnostic  rhyme  test  (DRT)  of  transmitted  voices  for  the  intelligibility  assessment, 

• The  spectral  analysis  of  synthesized  speech  for  the  visual  evaluation, 

• Transcription  of  a synthesized  speech  sample  on  a record  for  audition. 


Intelligibility  Test 

An  important  objective  of  the  DRT  [18]  is  in  the  determination  of  speech  perception 
as  influenced  by  process  parameters  (the  parameter  update  rate,  the  number  of  bits  for  each 
parameter,  and  the  choice  of  parameters).  The  test  not  only  provides  the  measure  of  intel- 
ligibility but  also  evaluates  the  discriminability  of  six  distinctive  features:  voicing,  nasality, 
sustention,  sibilation,  graveness,  and  compactness.  The  DRT  word  list  is  comprised  of  448 
monosyllable  rhyming  word  pairs  in  which  initial  consonants  differ  by  only  a single  feature. 

Table  3 lists  the  DRT  score  of  the  600-bps  voice  digitizer.  For  comparison  the  DRT 
scores  of  the  present  Navy  experimental  2400-bps  LPE  are  also  listed.  The  DRT  score  of 
the  600-bps  voice  digitizer  is  79.9  percent,  which  is  an  acceptable  but  not  a particularly 
high  score.  For  comparison  a previous  formant  vocoder  developed  by  Melpar  [37]  scored 
only  67  percent  and  required  1200  bps.  Additional  refinement  of  the  600-bps  digitizer  is 
in  progress  in  the  hope  of  improving  the  DRT  score. 


Table  3 — Summary  of  DRT  Score  at  600  bps  and.  For 
Comparison,  at  2400  bps 


Feature 

Perception 

600-bpa 
Voice  Digitizer 

2400- bp# 
LPE 

Voicing 

Diatinguiahea  /b/  from  Ipl, 
/d/  from  It/,  /v/  from 
HI,  etc. 

99.9 

89.6 

Natality 

Diatinguiahea  /nl  from  Id/, 
/ml  from  /b/,  etc. 

84.4 

93.6 

Sustention 

Diatinguiahea  /f / from  Ipl, 
/by  from  /v/,  /t/  from 
181,  etc. 

78.1 

77.0 

Sibilation 

Diatinguiahea  hi  from  181, 
III  from  /d/,  etc. 

60.2 

93.2 

Graveneu 

Diatinguiahea  Ipl  horn  /t / 
/b/  from  Idl,  1*1  from 
It/,  Iml  from  Ini,  etc. 

68.0 

81.6 

Compactness 

Diatinguiahea  lyl  from  /«/, 
/g/  from  Idl,  Ikl  from 
It/,  III  from  hi,  etc. 

88.3 

93.0 

Average 

79.9 

88.0 
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Spectral  Analysis  of  Synthesized  Speech 

Spectral  analysis  by  the  sound  spectrograph  is  a simple  and  convenient  means  of  evalu- 
ating the  formant  tracking  performance  of  the  600-bps  voice  digitizer.  Figure  11  shows  the 
spectrographs  of  the  original  and  synthesized  speech.  This  example  was  taken  from  a por- 
tion of  speech  on  the  phonograph  record  included  with  this  report.  The  sentence  contains 
many  varieties  of  sound  elements:  vowels,  consonants,  vowelike  sounds  (/r/  and  /l/),  a 
nasal  sound  (/n/),  a voiced  fricative  (/&■/)  and  voiceless  stops  (HI).  In  comparison  with 
spectrograms  of  previous  formant  vocoders  [Figure  10  of  Ref  4]  the  synthesized  speech  of 
the  600-bps  voice  digitizer  gives  remarkably  faithful  spectral  patterns. 


Demonstration  Record  of  Synthesized 
Speech  Samples 

The  phonograph  record  included  with  this  report  contains  several  examples  of  synthe- 
sized speech  at  600  bps.  Each  sample  is  composed  of  conversational  sentences.  The  listener 
may  decide  as  to  the  practicality  of  the  600-bps  voice  digitizer  for  voice  communications 
from  these  samples.  The  spoken  text  is  intentionally  not  given  in  this  report,  to  avoid  bias- 
ing the  listener. 


CONCLUSIONS 

This  report  described  a practical  scheme  whereby  voice  communications  at  a data  rate 
of  600  bps  is  possible.  The  approach  is  attractive  because  the  600-bps  voice  digitizer  is  a 
simple  extension  of  a 2400-bps  linear  predictive  encoder  which  will  be  generally  deployed 
by  DOD  and  other  government  agencies.  The  600-bps  voice  digitizer  uses  the  output  of  the 
2400-bps  linear  predictive  encoder  by  converting  its  linear  predictive  coefficients  to  three 
formant  frequencies  and  then  matching  the  frequency  patterns  to  preselected  reference 
patterns  for  economical  coded  transmission. 

Some  speech  quality  is  lost  at  600  bps,  and  the  synthesized  speech  sounds  nasal,  is 
occasionally  slurred,  and  lacks  some  of  the  normal  speaker  identification  capability.  How- 
ever the  600-bps  voice  digitizer  can  produce  synthesized  speech  that  has  adequate  intelli- 
gibility for  specialized  military  voice  communications.  Three  areas  now  require  further 
investigation:  improvement  of  the  intelligibility,  reduction  of  the  prevailing  nasal  quality, 
and  evaluation  of  the  performance  under  transmission-error  conditions. 
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Fig.  11  — Spectral  analysis  of  original  speech  and  as  synthesized  by  the  600-bps  voice  synthethizer 
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